Introduction to Python is brought to you by the Centre for the Analysis of Genome Evolution & Function (CAGEF) bioinformatics training initiative. This course was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology.
The structure of this course is a code-along style; it is 100% hands on! A few hours prior to each lecture, the materials will be avaialable for download at QUERCUS and also distributed via email. The teaching materials will consist of a Jupyter Lab Notebook with concepts, comments, instructions, and blank spaces that you will fill out with Python code along with the instructor. Other teaching materials include an HTML version of the notebook, and datasets to import into Python - when required. This learning approach will allow you to spend the time coding and not taking notes!
As we go along, there will be some in-class challenge questions for you to solve either individually or in cooperation with your peers. Post lecture assessments will also be available (see syllabus for grading scheme and percentages of the final mark).
We'll take a blank slate approach here to Python and assume that you pretty much know nothing about programming. From the beginning of this course to the end, we want to get you from some potential scenarios:
A pile of data (like an excel file or tab-separated file) full of experimental observations and you don't know what to do with it.
Maybe you're manipulating large tables all in excel, making custom formulas and pivot table with graphs. Now you have to repeat similar experiments and do the analysis again.
You're generating high-throughput data and there aren't any bioinformaticians around to help you sort it out.
You heard about Python and what it could do for your data analysis but don't know what that means or where to start.
and get you to a point where you can:
Format your data correctly for analysis
Produce basic plots and perform exploratory analysis
Make functions and scripts for re-analysing existing or new data sets
Track your experiments in a digital notebook like Jupyter!
Welcome to this second lecture in a series of six. Today you will dive into more detailed data structures, packages that works with them, and build up towards our a more standard data science structure - the DataFrame.
At the end of this lecture we will aim to have covered the following topics:
NumPy package and arraysPandas package and DataFramesgrey background - a package, function, code, command or directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or folder
bold - heading or a term that is being defined
blue text - named or unnamed hyperlink
... - Within each coding cell this will indicate an area of code that students will need to complete for the code cell to run correctly.
IPython and InteractiveShell will be access just to set the behaviour we want for iPython so we can see multiple code outputs per code cell.
random is a package with methods to add pseudorandomness to programs
numpy provides a number of mathematical functions as well as the special data class of arrays which we'll be learning about today.
# ----- Always run this at the beginning of class so we can get multi-command output ----- #
# Access options from the iPython core
from IPython.core.interactiveshell import InteractiveShell
# Change the value of ast_node_interactivity
InteractiveShell.ast_node_interactivity = "all"
# ----- Additional packages we want to import for class ----- #
# import random
# import numpy as np
# import pandas as pd
As discussed in lecture 1, everything in Python is an object:
The above are all objects but also data types in Python. We can store these data types in data structures to properly store, format, and model our data. The decision of what data structure to use depends on the objective(s), data types, and the tasks to perform. For example, some data structures can handle only one data type at the time (all numeric or all character) but are computationally very fast. Others structures can store several data types but are computationally very expensive and slow, especially when we have large datasets (thousands of rows and columns).
Another feature to look for on data structures is their mutability; some structures can be altered after they are created (mutable), some are not (immutable). Let's take a look at some of Python's core data types (built-in data structures).
The first Python object that we will introduce is lists, which are ordered collections of data of one or several types (strings, booleans, etc.) where each datum is called an element or item. Lists are easily identifiable because of the squared brackets that enclose their elements.
# Make a basic list. What data types are these?
[10, 20, 30, 40]
[10, 20, 30, 40]
# Make another mono-type list.
[order, family, genus, species]
# What data type am I attempting to store and why is it not working?
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-3-1b9fbd26d06d> in <module> 1 # Make another mono-type list. 2 ----> 3 [order, family, genus, species] 4 5 # What data type am I attempting to store and why is it not working? NameError: name 'order' is not defined
# Make a proper mono-type list of strings
["order", "family", "genus", "species"]
['order', 'family', 'genus', 'species']
When working with different data types and information that you'd like to pass around, it's convenient to know that you can combine this information into a single list. As mentioned above, lists can be arbitrary types so long as you remember the order of your elements, you can put a variety of data types into the same list object and you won't be subject to coercion.
Let's try!
# Make a list of arbitrary elements
["order", 20, 30.5, True]
['order', 20, 30.5, True]
list() to initialize an empty list¶When writing flow control programs (lecture 4), we need to create empty structures in advance so the program has a place to write its output. This is called "initialization" and to create a list we use the list() function. In fact, all classes (which make objects) should have some kind of initializer to create an object even if they are essentially "empty" containers at their outset.
[] # this is an empty list
list() # another way to create an empty list
[]
[]
Remember you can inquire about the class of an object with type()
print(type([]))
<class 'list'>
So far we've just been making lists that disappear from memory but you may want to make a list that you can pass around, grow, shrink, or pull information from.
# Make some lists for us to look at
list_1 = ["genome", 20, 30.5, True]
list_2 = [1, 100, 10, 25]
list_3 = [3]
# Take a look at those lists individually
list_1
list_2
list_3
# Print out all the lists in a single line. Look carefully at how they are displayed!
print(list_1, list_2, list_3)
['genome', 20, 30.5, True]
[1, 100, 10, 25]
[3]
['genome', 20, 30.5, True] [1, 100, 10, 25] [3]
[ ]¶Items in a list can be accessed using squared brackets the same way we did with strings (lecture 1). Many of the same indexing methods and mechanisms work the same way as with strings. Recall we use the [index] syntax to access a single element of our list which is also zero-indexed.
# Access the first element of list_1
list_1[0]
'genome'
The items in a list can be modified by indexing the item that we want to change (remember that lists are mutable)
list_1 [0] = "genomics" # The space between list_1 and [0] is not needed but improves readability
list_1
['genomics', 20, 30.5, True]
# You can also use negative indices!
list_1[-3]
20
in¶We can ask if a single element is present within a list using the in keyword. This can be a quick way to determine if your list has the element or item you are seeking.
# Check for "genomics" in your list
"genomics" in list_1
True
"genome" in list_1
False
We can perform mathematical operations on lists and between lists. Let's explore list_2 using some built-in functions such as
len() (length)max() (maximum)min() (minimum)sum()len(list_2) # 4 elements long
max(list_2) # the max number in list_2 is 100
min(list_2) # the min number is 1
sum(list_2) / len(list_2) # The mean of list_2 as a float
sum(list_2) // len(list_2) # The mean of list_2 as an integer. Notice the double forward slash!
4
100
1
34.0
34
Yes, like most data types you can use basic operators on lists but what will their behaviour be? Remember from last week that we saw different behaviours between numbers and strings! Let's explore further.
+ operator to concatenate two or more lists¶Much like strings, the + operator takes on a different behaviours when working with lists versus numbers. Rather than being interpreted as an addition operation, it is a concatenation symbol, allowing you to combine two or more lists together.
# Remind us, what is in list_2?
list_2
# Make another list of integers
list_4 = [ 2, 20, 200]
# Combine the two lists
list_5 = list_4 + list_2
list_5
[1, 100, 10, 25]
[2, 20, 200, 1, 100, 10, 25]
* operator to repeat your list¶Again we see a separate behaviour for a mathematical operator given the context of a list data structure. If you'd like to repeat your list one or more times, you can use the * operator.
# Repeat 22 five times
[22] * 5
# Repeat an entire list three times
list_2 * 3
[22, 22, 22, 22, 22]
[1, 100, 10, 25, 1, 100, 10, 25, 1, 100, 10, 25]
[start:end]¶I think we are seeing a pattern now between lists and strings (why do you think that is?). So we can slice our lists using the same notation we learned in lecture 1. Just remember that we are working with [inclusive:exclusive] form. So what does that really mean? Let's review.
# Full list
list_2
# grab the 2nd, and 3rd element
list_2[1:3]
# Grab from the 1st to the 3rd element (index 0-2)
list_2[:3]
# Grab from the 3rd to the last element (index 2-3)
list_2[2:]
# What does this get us?
list_2[2:3]
[1, 100, 10, 25]
[100, 10]
[1, 100, 10]
[10, 25]
[10]
random.seed() command to set up reproducible "random" sequences¶Let's try something more "practical" by generating a random sequence of nucleotides and working with that. We'll introduce the seed(n) function from the random package. We'll also use the sample(sequence, n) function to take an item from our list without replacement. Remember how to access functions from packages?
Furthermore, we'll be seeing our first use of a for loop but we'll dig deeper into that in lecture 4 (flow control).
The only random things you'll find in computer science are whether your programs will run on the first try and whether you'll understand them 6 months from now.
More seriously, the ability to generate a truly random sequence of numbers is not possible. We can approximate randomness - especially with special hardware but from a software perspective we can only mimic stochastic processes. Generally our approximations or pseudorandom algorithms are, to the casual observer just as good as truly random events. They are however, deterministic and can be repeated if we know the start state of the process. Usually a random number generator might use something like the system time as a seed to initialize its state but if we use a specific seed, we can get repeatable results.
# Create a list containing the DNA nucleotides
nucleotides = ["A", "C", "G", "T"]
type(nucleotides)
# Random is a package with methods to add pseudorandomness to programs
import random
random.seed(3) # Seed setting is a good practice to make code reproducible when code involves randomness
gene_sequence = [ ] # Initialize with an empty list
# Example of how a for loop looks like (more about flow control on lecture 4 so do not worry about the code right now)
for base in nucleotides:
gene_sequence = gene_sequence + random.sample(nucleotides, 3)
gene_sequence
type(gene_sequence)
list
['C', 'G', 'A']
['C', 'G', 'A', 'G', 'T', 'C']
['C', 'G', 'A', 'G', 'T', 'C', 'A', 'G', 'T']
['C', 'G', 'A', 'G', 'T', 'C', 'A', 'G', 'T', 'T', 'C', 'A']
list
# Just a little command to make our output look better in Jupyter!
%pprint
# Let's add a second gene_sequence statement to extend our DNA sequence.
# This time, let's take the print statement out of the for loop
random.seed(3)
gene_sequence = [] # Initialize with an empty list
for base in nucleotides:
gene_sequence = gene_sequence + random.sample(nucleotides, 3)
gene_sequence = gene_sequence + random.sample(gene_sequence, 3)
# gene_sequence
gene_sequence
type(gene_sequence)
# We'll be playing around with gene_sequence quite a bit so let's make a copy
gene_sequence_copy = list(gene_sequence)
Pretty printing has been turned OFF
['C', 'G', 'A', 'G', 'A', 'C', 'A', 'C', 'G', 'G', 'G', 'G', 'T', 'G', 'C', 'A', 'G', 'A', 'C', 'G', 'A', 'G', 'T', 'C']
<class 'list'>
Slice the first 10 nucleotides which are CGAGACACGG
gene_sequence[1:10]
['G', 'A', 'G', 'A', 'C', 'A', 'C', 'G', 'G']
Why is the first cytosine missing?
# Remember that Python uses zero indexation and the last index in a range is excluded (in this case, index 10)
gene_sequence[0:10]
['C', 'G', 'A', 'G', 'A', 'C', 'A', 'C', 'G', 'G']
Can you think of another way to index the first 10 nucleotides?
gene_sequence[:10]
['C', 'G', 'A', 'G', 'A', 'C', 'A', 'C', 'G', 'G']
Analogously, omitting the second index, returns elements all the way to the last element
gene_sequence[12:] # with this syntax, the last element is included
['T', 'G', 'C', 'A', 'G', 'A', 'C', 'G', 'A', 'G', 'T', 'C']
Slicing can also be used to update several elements at the time. Replace bases G and A (bases number 2 and 3 from gene_sequence) by R (R in a DNA sequence means that either adenine or guanine (puRines) can be found at that position)
gene_sequence
#Replace the 2nd and 3rd element of your list with R
gene_sequence[2:3] = ["R", "R"]
gene_sequence
['C', 'G', 'R', 'R', 'G', 'A', 'C', 'A', 'C', 'G', 'G', 'G', 'G', 'T', 'G', 'C', 'A', 'G', 'A', 'C', 'G', 'A', 'G', 'T', 'C']
# Notice that we're making a new list with the list() command?
gene_sequence = list(gene_sequence_copy)
Now, run the code below several times (press ctrl/command + enter 10 times consecutively) . Pay attention to the prints. Do you see what is happening?
gene_sequence
# Replace elements 2 and 3 with just a single R
gene_sequence[1:3] = ["R"]
gene_sequence
# This time, we are telling Python that we want to replace indices 1 and 2 with just one element, "R".
# That is why gene_sequence becomes shorter and shorter as you run the code over and over again
['C', 'R', 'C', 'G', 'G', 'G', 'G', 'T', 'G', 'C', 'A', 'G', 'A', 'C', 'G', 'A', 'G', 'T', 'C']
['C', 'R', 'G', 'G', 'G', 'G', 'T', 'G', 'C', 'A', 'G', 'A', 'C', 'G', 'A', 'G', 'T', 'C']
Okay, let's go back to our task. Before that, though, go ahead and copy gene_sequence_copy one more time
# Notice that we're making a new list with the list() command?
gene_sequence = list(gene_sequence_copy)
Back to our task: To change the second and third nucleotides in gene_sequence so we should start at index 1, not at index 2. Again: Python has zero indexation
# Notice that we're making a new list with the list() command?
gene_sequence = list(gene_sequence_copy)
print("gene_sequence before modification: " + str(gene_sequence[0:10]))
# Define the indices of interest and, the righ side of the equal sign, without squared brackets, add the elements that you want to incorporate
gene_sequence[1:3] = ["R", "R"]
print("gene_sequence after modification: " + str(gene_sequence[0:10]))
gene_sequence before modification: ['C', 'G', 'A', 'G', 'A', 'C', 'A', 'C', 'G', 'G'] gene_sequence after modification: ['C', 'R', 'R', 'G', 'A', 'C', 'A', 'C', 'G', 'G']
join() method to concatenate list elements¶Naturally, the most appropriate way to store a gene sequence is as a string, not a as list. Let's type convert gene_sequence into a string with the join() function which takes the form of separator.join(sequence) where
sequence is the given list of elementsseparator is the character, or string, that you wish to occur between elements of your sequenceLet's give it a try!
# Make gene_sequence a string
# The quotes are empty because we are joining the elements within gene_sequence instead of joining it to another list
gene_sequence = "".join(gene_sequence)
# Now we have a gene sequence tha we can use
gene_sequence
type(gene_sequence)
# print("gene_sequence is now of " + str(type(gene_sequence)) + " (type string)")
'C- -R- -R- -G- -A- -C- -A- -C- -G- -G- -G- -G- -T- -G- -C- -A- -G- -A- -C- -G- -A- -G- -T- -C'
<class 'str'>
Time to move to other aspects of working with lists. Do you recall what methods are? Python objects, such as a list, have methods (behaviours) that can be applied to them to carry out functions. We access these with the . syntax much like functions from a package. In this case, however, we are accessing methods that belong to the object which is access by our variable. Thus our general syntax is variable.method().
Let's say that we want to add "biology" as an element to list_1. We know that the + operator concatenates two lists together, so we need to covert "biology" into a list and then concatenate the two lists together. Easy peasy, right?
# Let's revisit list_1
list_1
# Create a new list with the word "biology"
bio = list("biology")
list_1 + bio
['genomics', 20, 30.5, True]
['genomics', 20, 30.5, True, 'b', 'i', 'o', 'l', 'o', 'g', 'y']
list() breaks up your string!¶As you can see above, our type-conversion or casting of the string "biology" resulted in a list of the single characters that make up the string. Not exactly the behaviour we wanted! Instead we could cast the string "biology" as a list with the [] operators using:
bio = list(["biology"]) or bio = ["biology"]
bio = list(["biology"])
bio
bio = ["biology"]
bio
# Now you can successfully add it to the list!
['biology']
['biology']
.append() or .extend() methods to quickly add to your list without type-casting!¶Rather than spend extra code instantiating a variable and type-casting it to a proper list, you can use the .append() method to add a single element to your list. To add multiple elements use the .extend() method instead.
list_1 = ["genome", 20, 30.5, True]
list_1.append("biology")
list_1
['genome', 20, 30.5, True, 'biology']
# Let's add list_2 to list_1
list_1.extend(list_2)
list_1
['genome', 20, 30.5, True, 'biology', 1, 100, 10, 25]
Notice that list_1 was modified while list_2 remains intact
list_2
[1, 100, 10, 25]
.sort() method on your list to order the elements¶Another method, sort() lets you sort the elements in a list. As a caveat, .sort() requires all the elements to be of the same type in order to make proper comparisons. It can come in the form myList.sort(reverse = (True|False), key = MyFunc) where:
reverse (optional) determines if you sort in ascending (True) or descending (FALSE) orderkey (optional) can be a function that determines how to sort your elements.Note that as a method, .sort() will permanently change your list, just like .append() or .extend()!
# Assign bio as a list
bio = list("biology")
bio
# Apply bio.sort()
bio.sort()
# print bio after sorting it
bio # Now bio has changed (mutated)
# sort list_2
list_2.sort()
list_2
['b', 'i', 'o', 'l', 'o', 'g', 'y']
['b', 'g', 'i', 'l', 'o', 'o', 'y']
[1, 10, 25, 100]
.sort()'s default behavior is .sort(reverse = False) (lowest-to-highest) but it can be overwritten. How would you call on .sort() to give you the reverse order?
# Let's do a reverse sort on list_2
list_2.sort(reverse = True)
list_2
[100, 25, 10, 1]
We've seen how to add and replace elements within your list but sometimes you want to delete an element completely. Let's look at some of the ways to accomplish this.
.pop() method¶The .pop(index) method will return the removed element, while altering the list object. This way you can save that element, move or copy it to another list, or run analyses on the value/object as needed.
print(list_1) # before
reordered_dropped = list_1.pop(0) # pop (dropped element) is stored as a separate object in case we need it
print(reordered_dropped)
print(list_1) # after
['genome', 20, 30.5, True, 'biology', 1, 100, 10, 25] genome [20, 30.5, True, 'biology', 1, 100, 10, 25]
del() function to remove an element¶If you are not interested in the removed value then you can use the del() function. Note how we aren't using the same kind of notation as above? We're using it as a function instead. There are two forms of syntax to use this function but it basically works with slice notation as well!
list_4 = [2, 20, 200]
print(list_4)
# del list_4[1] # del is a function, reason why it does not use dot notation.
# It can also be used with parenthesis: del(list_4[1])
del(list_4[1])
# alternatively, this syntax can also be used with functions:
# del list_4[1] # same as del()
print(list_4)
[2, 20, 200] [2, 200]
.remove() method to remove an element without knowing its index¶More often that not, we tend to remember what element we want to drop but not its index. The .remove() method will find and delete the first occurrence of an element mentioned in its argument
Let's remove the number 100 from list_2, this time with .remove():
# Let's set our list again
list_2 = [1, 100, 10, 25]
# Use the .remove() method
list_2.remove(100)
print(list_2)
[1, 10, 25]
.split() method to break a string into a list of strings¶Sometimes you may have a complex string that you'd like to break up based on a pattern, specific character, delimiter etc. To accomplish this we can use the string method .split(). The resulting output returned, however, is a list of strings regardless of whether or not the pattern itself is found.
Now, let's subset "biology" from list_1, then use .split() to split "biology" wherever there is an "o".
list_1
# Isolate the right element and split it
print(list_1[3].split("o")) # Biology is at index 3
# This is essentially what is happening
"biology".split("o")
# What happens to list_1? It remains unaltered
list_1
[20, 30.5, True, 'biology', 1, 100, 10, 25]
['bi', 'l', 'gy']
['bi', 'l', 'gy']
[20, 30.5, True, 'biology', 1, 100, 10, 25]
As we mentioned in the beginning, a list can have an arbitrary grouping of data types/object/elements. That means we can make a list with a list. This is also known as a nested list. When it comes to making more complex structures, it means we can provide a hierarchy to a structure using lists.
# Each bracket subsets the list
# We can use : notation to copy a lists contents (references) too
list_6 = list_1[:]
list_6.append(["extra", "elements", 3])
list_6
list_1
# How do we just add on that extra list to list_1?
[20, 30.5, True, 'biology', 1, 100, 10, 25, ['extra', 'elements', 3]]
[20, 30.5, True, 'biology', 1, 100, 10, 25]
# Let's make a nested list of amino acid information
list_aminoacids = [["Alanine", "Ala", "A"],
["Arginine", "Arg", "R"],
["Asparagine", "Asn", "N"],
["Aspartic acid", "Asp", "D"]]
list_aminoacids
[['Alanine', 'Ala', 'A'], ['Arginine', 'Arg', 'R'], ['Asparagine', 'Asn', 'N'], ['Aspartic acid', 'Asp', 'D']]
[ ][ ] to access individual elements of your nested list¶In order to access specific parts of your list, you'll need to remember its structure and use that information. When there is an ordered pattern to your data structure, it can be easier to generate code for traversing the nested list. The [] operators will begin their access from the top-most level of the nested list and move downward through each level.
For a 2D nested list, you can think of it like a table or spreadsheet where you access using a [row][column] syntax.
list_aminoacids[2][0] # the first index is the row number and the second is the column
list_aminoacids[0][1]
# You can slice nested lists, but only kind of.
list_aminoacids[1:3]
'Asparagine'
'Ala'
[['Arginine', 'Arg', 'R'], ['Asparagine', 'Asn', 'N']]
You are interested in finding all "GCC" codons and their location in list_alanine. Tip: Run dir(list_alanine) to see a list of all the attributes and methods that are part of list list_alanine.
list_alanine = ["GCA", "GCC", "GCG", "GCU"]
print("Methods available for lists: " + str(dir(list_alanine)))
Methods available for lists: ['__add__', '__class__', '__contains__', '__delattr__', '__delitem__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__iadd__', '__imul__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__rmul__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', 'append', 'clear', 'copy', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort']
# is count a method or an attribute?
getattr(list_alanine, "count")
<built-in method count of list object at 0x000001BD547579C0>
list_alanine
# First, a quick check to make sure that "GCC" is actually present in alanine
"GCC" in list_alanine
# Count the number of cytosines with .count()
# Note that you can get the output without using the print() command because of our iPython changes
print("The codon GCC is present " + str(list_alanine.count("GCC")) + " time(s) in list_alanine")
# Locate GCC in list_alanine with .index()
print("The codon GCC can be found at index " + str(list_alanine.index("GCC")))
['GCA', 'GCC', 'GCG', 'GCU']
True
The codon GCC is present 1 time(s) in list_alanine The codon GCC can be found at index 1
Here is a more detailed description of the list methods and what they do:
| Method call | Description | Alters the list? |
|---|---|---|
| append() | Add an element to the end of the list | Yes |
| extend() | Add all elements of a list to the another list | Yes |
| insert() | Insert an item at the defined index | Yes |
| remove() | Removes an item from the list | Yes |
| pop() | Removes and returns an element at the given index | Yes |
| clear() | Removes all items from the list | Yes |
| sort() | Sort items in a list in ascending order | Yes |
| reverse() | Reverse the order of items in the list | Yes |
| index() | Returns the index of the first matched item | No |
| count() | Returns the count of number of items passed as an argument | No |
| copy() | Returns a shallow (new memory) copy of the list | No |
So far, we have seen that lists are very useful to store diverse types of data in a single structure. However, lists are not that convenient to use when we need to extract data in groups or elements that "belong together". Let's revisit our nested list list_aminoacids from section 1.11.0. What if you were interested in getting any data associated with aspartic acid? Do you remember what index Asp was at? Under this circumstance, we are better off using dictionaries.
Dictionaries are similar to lists and they can also take (almost) any data type. Unlike lists, however, dictionaries are not ordered. Intead, dictionaries map a key to a set of values that are related to that key, i.e. keys:values that belong together. In our example, Aspartic acid is the key and the set of value is "Asp".
Dictionaries are created by using { } (curly brackets), a feature that also makes them easily identifiable. Unlike lists, though, dictionaries keys are immutable and have no defined order. As a side note, Python dictionaries are usually called hashes in other programming languages. In summary:
key:value system.For starters, let's create a dictionary called aminoacids_dict.
# Create a key:value pairing of amino acids
aminoacids_dict = {"Alanine": "Ala",
"Arginine": "Arg",
"Asparagine": "Asn",
"Aspartic acid": "Asp"}
aminoacids_dict
{'Alanine': 'Ala', 'Arginine': 'Arg', 'Asparagine': 'Asn', 'Aspartic acid': 'Asp'}
Now that we created a dictionary, let's access its keys and values
aminoacids_dict["alanine"]
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-60-6b0581c15783> in <module> ----> 1 aminoacids_dict["alanine"] KeyError: 'alanine'
What is the problem? I am 100% certain that alanine is in that list...
# Remember that Python is case sensitive
aminoacids_dict["Alanine"]
'Ala'
dict() function to intialize an empty dictionary¶Analogously to lists, empty dictionaries can be created with the function dict() or just with { }. These can be assigned to variables when needed.
{} # This is an empty dictionary
dict() # Another way to create an empty dictionary
{}
{}
# Add Cysteine and Glutamic acid to our dictionary
aminoacids_dict["Cysteine"] = "Cys"
aminoacids_dict["Glutamic"] = "Glu"
aminoacids_dict
{'Alanine': 'Ala', 'Arginine': 'Arg', 'Asparagine': 'Asn', 'Aspartic acid': 'Asp', 'Cysteine': 'Cys', 'Glutamic': 'Glu'}
.update() method¶If you have more than one entry you'd like to add to your dictionary at once, you can essentially create a second dictionary to add through the .update() method. Remember that {key1:value1, key2:value2} initializes a new dictionary.
# Let's add a four more entries at to our dictionary in a single call.
aminoacids_dict.update({"Glutamine": "Gln",
"Glycine": "Gly",
"Histidine": "His",
"Isoleucine": "Ile"})
aminoacids_dict
{'Alanine': 'Ala', 'Arginine': 'Arg', 'Asparagine': 'Asn', 'Aspartic acid': 'Asp', 'Cysteine': 'Cys', 'Glutamic': 'Glu', 'Glutamine': 'Gln', 'Glycine': 'Gly', 'Histidine': 'His', 'Isoleucine': 'Ile'}
Now, let's rebuild our aminoacids database, this time with their respective symbols and encoding codons stored as sets. The main differences between sets and lists are:
set() or {} syntax to initialize.aminoacids_dict = {"Alanine": {"Ala", "A", "GCA GCC GCG GCT"},
"Cysteine": {"Cys", "C", "TGC, TGT"},
"Aspartic acid": {"Asp", "D", "GAC GAT"},
"Glutamic acid": {"Glu", "E", "GAA GAG"},
"Phenylalanine": {"Phe", "F", "TTC TTT"},
"Glycine": {"Gly", "G", "GGA GGC GGG GGT"},
"Histidine": {"His", "H", "CAC CAT"},
"Isoleucine": {"Ile", "I", "ATA ATC ATT"},
"Lysine": {"Lys", "K", "AAA AAG"},
"Leucine": {"Leu", "L", "TTA TTG CTA CTC CTG CTT"},
"Methionine": {"Met", "M" "ATG"},
"Asparagine": {"Asn", "N", "AAC AAT"},
"Proline": {"Pro", "P", "CCA CCC CCG CCT"},
"Glutamine": {"Gln", "Q", "CAA CAG"},
"Arginine": {"Arg", "R", "AGA AGG CGA CGC CGG CGT"},
"Serine": {"Ser", "S", "AGC AGT TCA TCC TCG TCT"},
"Threonine": {"Thr", "T", "ACA ACC ACG ACU"},
"Valine": {"Val", "V", "GTA GTC GTG GTT"},
"Tryptophan": {"Trp", "W", "TGG"},
"Tyrosine": {"Tyr", "Y," "TAC TAT"}
}
# What is the typing of our new dictionary?
type(aminoacids_dict)
# Print out our dictionary
aminoacids_dict
<class 'dict'>
{'Alanine': {'Ala', 'GCA GCC GCG GCT', 'A'}, 'Cysteine': {'TGC, TGT', 'C', 'Cys'}, 'Aspartic acid': {'GAC GAT', 'Asp', 'D'}, 'Glutamic acid': {'GAA GAG', 'Glu', 'E'}, 'Phenylalanine': {'Phe', 'F', 'TTC TTT'}, 'Glycine': {'GGA GGC GGG GGT', 'Gly', 'G'}, 'Histidine': {'CAC CAT', 'His', 'H'}, 'Isoleucine': {'ATA ATC ATT', 'I', 'Ile'}, 'Lysine': {'Lys', 'AAA AAG', 'K'}, 'Leucine': {'L', 'Leu', 'TTA TTG CTA CTC CTG CTT'}, 'Methionine': {'MATG', 'Met'}, 'Asparagine': {'AAC AAT', 'Asn', 'N'}, 'Proline': {'Pro', 'P', 'CCA CCC CCG CCT'}, 'Glutamine': {'Gln', 'Q', 'CAA CAG'}, 'Arginine': {'Arg', 'AGA AGG CGA CGC CGG CGT', 'R'}, 'Serine': {'S', 'Ser', 'AGC AGT TCA TCC TCG TCT'}, 'Threonine': {'ACA ACC ACG ACU', 'T', 'Thr'}, 'Valine': {'GTA GTC GTG GTT', 'Val', 'V'}, 'Tryptophan': {'Trp', 'TGG', 'W'}, 'Tyrosine': {'Y,TAC TAT', 'Tyr'}}
As you can see, Python objects can be overwritten so be careful when you create new variables. Immutability, in the case of dictionaries, makes reference to the fact that keys can be added or deleted but not changed.
.update() method as well¶Not only can you add new entries with .update() but you can also (as seen in our previous example) update the key:value pairs within a dictionary.
# Replace the Alanine values in our dictionary with .update()
aminoacids_dict.update({"Alanine": 'Changed values'})
aminoacids_dict
{'Alanine': 'Changed values', 'Cysteine': {'TGC, TGT', 'C', 'Cys'}, 'Aspartic acid': {'GAC GAT', 'Asp', 'D'}, 'Glutamic acid': {'GAA GAG', 'Glu', 'E'}, 'Phenylalanine': {'Phe', 'F', 'TTC TTT'}, 'Glycine': {'GGA GGC GGG GGT', 'Gly', 'G'}, 'Histidine': {'CAC CAT', 'His', 'H'}, 'Isoleucine': {'ATA ATC ATT', 'I', 'Ile'}, 'Lysine': {'Lys', 'AAA AAG', 'K'}, 'Leucine': {'L', 'Leu', 'TTA TTG CTA CTC CTG CTT'}, 'Methionine': {'MATG', 'Met'}, 'Asparagine': {'AAC AAT', 'Asn', 'N'}, 'Proline': {'Pro', 'P', 'CCA CCC CCG CCT'}, 'Glutamine': {'Gln', 'Q', 'CAA CAG'}, 'Arginine': {'Arg', 'AGA AGG CGA CGC CGG CGT', 'R'}, 'Serine': {'S', 'Ser', 'AGC AGT TCA TCC TCG TCT'}, 'Threonine': {'ACA ACC ACG ACU', 'T', 'Thr'}, 'Valine': {'GTA GTC GTG GTT', 'Val', 'V'}, 'Tryptophan': {'Trp', 'TGG', 'W'}, 'Tyrosine': {'Y,TAC TAT', 'Tyr'}}
= operator to also update key:value pairs¶As with .update(), we can directly access a key:value pair and alter the value using the dictionary[key]=value syntax.
Let's bring back the Alanine values 'Ala', 'A', and 'GCA GCC GCG GCU'
# Revert the Alanine values in our dictionary using =
aminoacids_dict["Alanine"] = {'Ala', 'A', 'GCA GCC GCG GCT'}
aminoacids_dict
{'Alanine': {'Ala', 'GCA GCC GCG GCT', 'A'}, 'Cysteine': {'TGC, TGT', 'C', 'Cys'}, 'Aspartic acid': {'GAC GAT', 'Asp', 'D'}, 'Glutamic acid': {'GAA GAG', 'Glu', 'E'}, 'Phenylalanine': {'Phe', 'F', 'TTC TTT'}, 'Glycine': {'GGA GGC GGG GGT', 'Gly', 'G'}, 'Histidine': {'CAC CAT', 'His', 'H'}, 'Isoleucine': {'ATA ATC ATT', 'I', 'Ile'}, 'Lysine': {'Lys', 'AAA AAG', 'K'}, 'Leucine': {'L', 'Leu', 'TTA TTG CTA CTC CTG CTT'}, 'Methionine': {'MATG', 'Met'}, 'Asparagine': {'AAC AAT', 'Asn', 'N'}, 'Proline': {'Pro', 'P', 'CCA CCC CCG CCT'}, 'Glutamine': {'Gln', 'Q', 'CAA CAG'}, 'Arginine': {'Arg', 'AGA AGG CGA CGC CGG CGT', 'R'}, 'Serine': {'S', 'Ser', 'AGC AGT TCA TCC TCG TCT'}, 'Threonine': {'ACA ACC ACG ACU', 'T', 'Thr'}, 'Valine': {'GTA GTC GTG GTT', 'Val', 'V'}, 'Tryptophan': {'Trp', 'TGG', 'W'}, 'Tyrosine': {'Y,TAC TAT', 'Tyr'}}
del() or .pop()¶Much like lists, we can remove values from the dictionary directly with the del() function or one at a time using the .pop() method. These operate much in the same way they do for lists.
# Delete the "Cysteine" entry
del aminoacids_dict["Cysteine"] # del(dictionary_aminoacids["Cysteine"])
aminoacids_dict
{'Alanine': {'Ala', 'GCA GCC GCG GCT', 'A'}, 'Aspartic acid': {'GAC GAT', 'Asp', 'D'}, 'Glutamic acid': {'GAA GAG', 'Glu', 'E'}, 'Phenylalanine': {'Phe', 'F', 'TTC TTT'}, 'Glycine': {'GGA GGC GGG GGT', 'Gly', 'G'}, 'Histidine': {'CAC CAT', 'His', 'H'}, 'Isoleucine': {'ATA ATC ATT', 'I', 'Ile'}, 'Lysine': {'Lys', 'AAA AAG', 'K'}, 'Leucine': {'L', 'Leu', 'TTA TTG CTA CTC CTG CTT'}, 'Methionine': {'MATG', 'Met'}, 'Asparagine': {'AAC AAT', 'Asn', 'N'}, 'Proline': {'Pro', 'P', 'CCA CCC CCG CCT'}, 'Glutamine': {'Gln', 'Q', 'CAA CAG'}, 'Arginine': {'Arg', 'AGA AGG CGA CGC CGG CGT', 'R'}, 'Serine': {'S', 'Ser', 'AGC AGT TCA TCC TCG TCT'}, 'Threonine': {'ACA ACC ACG ACU', 'T', 'Thr'}, 'Valine': {'GTA GTC GTG GTT', 'Val', 'V'}, 'Tryptophan': {'Trp', 'TGG', 'W'}, 'Tyrosine': {'Y,TAC TAT', 'Tyr'}}
There are a number of additional dictionary methods that can provide information about or alter a dictionary object. For example len() gives us the number of key:value pairs in a dictionary.
Let's try out the len() function.
# How many entries are in our dictionary?
len(aminoacids_dict)
# Equivalent to this awkward code
aminoacids_dict.__len__()
19
19
We are missing STOP and START codons in our dictionary. Use Python code to demonstrate that STOP is not present in dictionary_aminoacids. Tip: Your answer should be a boolean.
"STOP" in aminoacids_dict
False
Use .get() to retrieve all values associated with glutamic acid. Tip: Use help() to find out more about the usage of .get(). If that is not enough, look it up on the internet.
help(dict.get)
aminoacids_dict.get("Glutamic acid")
Help on method_descriptor:
get(self, key, default=None, /)
Return the value for key if key is in the dictionary, else default.
{'GAA GAG', 'Glu', 'E'}
.values() method to extract values only from the dictionary¶If you need to extract only the values from a dictionary, use the method .values(). This will return a dict_values object but it can also be cast as a list object.
# Get the values from our dictionary AND cast it to a list at the same time
list_values_only = list(aminoacids_dict.values())
list_values_only
[{'Ala', 'GCA GCC GCG GCT', 'A'}, {'GAC GAT', 'Asp', 'D'}, {'GAA GAG', 'Glu', 'E'}, {'Phe', 'F', 'TTC TTT'}, {'GGA GGC GGG GGT', 'Gly', 'G'}, {'CAC CAT', 'His', 'H'}, {'ATA ATC ATT', 'I', 'Ile'}, {'Lys', 'AAA AAG', 'K'}, {'L', 'Leu', 'TTA TTG CTA CTC CTG CTT'}, {'MATG', 'Met'}, {'AAC AAT', 'Asn', 'N'}, {'Pro', 'P', 'CCA CCC CCG CCT'}, {'Gln', 'Q', 'CAA CAG'}, {'Arg', 'AGA AGG CGA CGC CGG CGT', 'R'}, {'S', 'Ser', 'AGC AGT TCA TCC TCG TCT'}, {'ACA ACC ACG ACU', 'T', 'Thr'}, {'GTA GTC GTG GTT', 'Val', 'V'}, {'Trp', 'TGG', 'W'}, {'Y,TAC TAT', 'Tyr'}]
All of our examples so far have looked at the types of objects that can be used as values. To remind you, the keys of dictionary are made of immutable types such as integers, strings, booleans, or tuples (coming up!), but no lists or dictionaries - remember these are mutable.
# Add keys and values using update() or squared-bracket notation
dictionary_empty = {} #empty
# Use a string
dictionary_empty.update({"string" : 1})
# Use an integer
dictionary_empty[15] = "integer"
# Use a boolean
dictionary_empty.update({True: 3})
dictionary_empty
{'string': 1, 15: 'integer', True: 3}
However, elements of a list can be passed on as keys to a dictionary provided that they too are immutable elements.
list_1
# Use a list element as a key
dictionary_test = {list_1[0]: 2.5}
dictionary_test
[20, 30.5, True, 'biology', 1, 100, 10, 25]
{20: 2.5}
Just like lists, you could use dictionaries as values within your dictionaries, thus generated nested dictionaries. Like nested lists, you can access hash keys at each level with progressing use of the [key] syntax.
# Build a dictionary where values are also dictionaries!
aminoacids_subset_dict = {'Alanine': {'One letter symbol':'A',
'Three letter symbol':'Ala',
'codons':'GCA GCC GCG GCU',
'Number of codons':4},
'Aspartic acid': {'One letter symbol':'D',
'Three letter symbol':'Asp',
'codons':'GAC GAU',
'Number of codons':2},
'Glutamic acid': {'One letter symbol':'E',
'Three letter symbol':'Glu',
'codons':'GAA GAG',
'Number of codons':2}
}
# check it's type
type(aminoacids_subset_dict)
type(aminoacids_subset_dict["Alanine"])
# Pull out a few specific sub-entries
aminoacids_subset_dict["Alanine"]["Number of codons"]
aminoacids_subset_dict["Glutamic acid"]["codons"]
<class 'dict'>
<class 'dict'>
4
'GAA GAG'
[ ] notation¶Unlike lists which still have an ordered nature, the dictionary keys do not have an ordered index. Therefore you cannot access specific elements by their "position" within the dictionary. Subsequently, since the elements have no order, you cannot use the slice notation [start:end] with dictionaries either.
# Try to pull an element out by index position. What happens?
aminoacids_subset_dict[0]
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-76-136785c30270> in <module> 1 # Try to pull an element out by index position. What happens? ----> 2 aminoacids_subset_dict[0] KeyError: 0
To summarize, list and dictionaries are very similar but have three key differences: Indexation, order, and mutability. If you need fast indexation to look for unique keys and their values, go with dictionaries.
Run dir(dict) for more methods on dictionaries.
We've already mentioned the concept tuples but haven't discussed clearly what these objects are. Also known as structured arrays, the tuple is simply put, an immutable list. Let's compare them to lists for a better understanding.
() to initialize.# When creating a one-element tuple, the comma after the element is required
# Otherwise it makes a string.
tuple_single = "a"
type(tuple_single)
# equivalent
tuple_single = ("a", "b")
type(tuple_single)
tuple_single
<class 'str'>
<class 'tuple'>
('a', 'b')
Time to make a tuple with multiple elements:
# Generate a tuple with multiple entries
tuple_aminoacids = ("Alanine", "ala", "A", "GCA GCC GCG GCU")
tuple_aminoacids
('Alanine', 'ala', 'A', 'GCA GCC GCG GCU')
If you try to change an element in a tuple, you get a traceback (Python error). Let's make Alanine all lowercase:
tuple_aminoacids[0]
tuple_aminoacids[0] = "alanine"
'Alanine'
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-82-7ec795c11b7c> in <module> 1 tuple_aminoacids[0] 2 ----> 3 tuple_aminoacids[0] = "alanine" TypeError: 'tuple' object does not support item assignment
tuple() function¶Like dictionaries, the immutability of tuples makes them good candidates to reliably store data that you do not want to be changed (like by accident). Similarly to lists, the function tuple() will split a single string into its component characters and will store them as elements:
tuple_alanine = tuple("Alanine")
print(tuple_alanine)
print(type(tuple_alanine))
len(tuple_alanine)
('A', 'l', 'a', 'n', 'i', 'n', 'e')
<class 'tuple'>
7
.count() and .index()¶As mentioned above, given the immutablitiy of the tuple, it doesn't need some of the same methods as a list. As a consequence, tuples have only two methods: .count() and .index()
dir(tuple_aminoacids)[-2] # Tuple methods
tuple_aminoacids
tuple_aminoacids.index("ala") # Ala is at index position 1
tuple_aminoacids.count("GCA GCC GCG GCU") # There is one instance of the string "A"
'count'
('Alanine', 'ala', 'A', 'GCA GCC GCG GCU')
1
1
.append() method¶Methods such as .append() do not operate on tuples because of their immutability. A workaround to append elements to a tuple is to first create a list, append elements to it, and then use type conversion to make a tuple.
list_aminoacids = [["Alanine", "Ala", "A", "GCA GCC GCG GCU"], # create a nested list
["Cysteine", "Cys", "C", "UGC, UGU"],
["Aspartic acid", "Asp", "D", "GAC GAU"],
["Glutamic acid", "Glu", "E", "GAA GAG"]
]
list_aminoacids
list_aminoacids.append(["Phenylalanine", "Phe", "F", "UUC UUU"]) # Append the desired elements to the list
print() # Use empty Prints to make the output more readable
list_aminoacids
tuple_aminoacids = tuple(list_aminoacids) # Type convert the list into a tuple
print()
tuple_aminoacids # Note the output differences!
type(tuple_aminoacids)
[['Alanine', 'Ala', 'A', 'GCA GCC GCG GCU'], ['Cysteine', 'Cys', 'C', 'UGC, UGU'], ['Aspartic acid', 'Asp', 'D', 'GAC GAU'], ['Glutamic acid', 'Glu', 'E', 'GAA GAG']]
[['Alanine', 'Ala', 'A', 'GCA GCC GCG GCU'], ['Cysteine', 'Cys', 'C', 'UGC, UGU'], ['Aspartic acid', 'Asp', 'D', 'GAC GAU'], ['Glutamic acid', 'Glu', 'E', 'GAA GAG'], ['Phenylalanine', 'Phe', 'F', 'UUC UUU']]
(['Alanine', 'Ala', 'A', 'GCA GCC GCG GCU'], ['Cysteine', 'Cys', 'C', 'UGC, UGU'], ['Aspartic acid', 'Asp', 'D', 'GAC GAU'], ['Glutamic acid', 'Glu', 'E', 'GAA GAG'], ['Phenylalanine', 'Phe', 'F', 'UUC UUU'])
<class 'tuple'>
Accessing tuple elements is the same as a list and you can use the [] notation to assign elements to individual variables one by one.
var_1 = tuple_aminoacids[0]
var_2 = tuple_aminoacids[1]
print(var_1, var_2)
['Alanine', 'Ala', 'A', 'GCA GCC GCG GCU'] ['Cysteine', 'Cys', 'C', 'UGC, UGU']
You can also use multiple assignment (remember lecture 1?) to assign several variables at once. Just be sure that each side balances out! Note that you can also use slice notation on a tuple.
var_a, var_b, var_c, var_d, var_e, var_f = tuple_alanine[0:6]
# var_a, var_b, var_c, var_d, var_e, var_f = tuple_alanine
print(var_a, var_b, var_c, var_d)
A l a n
.items() method in dictionaries to make tuples¶Jumping back quickly into dictionaries, you may want to preserve or move around a key:value pair. You can do so using the .items() method which will return a dict_items object. From there, you can use tuple() to cast it over to a tuple object either individually, or as a tuple of tuples.
tuple(aminoacids_dict.items())[0]
tuple(aminoacids_dict.items())[-1]
tuple(aminoacids_dict.items())[4:]
('Alanine', {'Ala', 'GCA GCC GCG GCT', 'A'})
('Tyrosine', {'Y,TAC TAT', 'Tyr'})
(('Glycine', {'GGA GGC GGG GGT', 'Gly', 'G'}), ('Histidine', {'CAC CAT', 'His', 'H'}), ('Isoleucine', {'ATA ATC ATT', 'I', 'Ile'}), ('Lysine', {'Lys', 'AAA AAG', 'K'}), ('Leucine', {'L', 'Leu', 'TTA TTG CTA CTC CTG CTT'}), ('Methionine', {'MATG', 'Met'}), ('Asparagine', {'AAC AAT', 'Asn', 'N'}), ('Proline', {'Pro', 'P', 'CCA CCC CCG CCT'}), ('Glutamine', {'Gln', 'Q', 'CAA CAG'}), ('Arginine', {'Arg', 'AGA AGG CGA CGC CGG CGT', 'R'}), ('Serine', {'S', 'Ser', 'AGC AGT TCA TCC TCG TCT'}), ('Threonine', {'ACA ACC ACG ACU', 'T', 'Thr'}), ('Valine', {'GTA GTC GTG GTT', 'Val', 'V'}), ('Tryptophan', {'Trp', 'TGG', 'W'}), ('Tyrosine', {'Y,TAC TAT', 'Tyr'}))
Built-in Python objects are very versatile: They can hold different data types, you can do so basic mathematical operations with them, can be mutable or not, and many more. However, they lack a key feature: Perform simple and advanced mathematical operations on data structures, in ways that are time- and computationally-efficient.
NumPy (Numeric Python) is a package for scientific computing developed by Travis Oliphant and first released in 2005, based on a pre-existenting python package called Numeric. NumPy has advanced and changed a lot since its first release, all thanks to a very active community of programmers that have contributed their time and effort to improve NumPy.
In terms of data structures, NumPy offers an alternative to built-in lists called NumPy arrays, which allow mathematical operations across one- or multi-dimensional arrays. Let's start diving into NumPy's functionality. First, though, we need to install NumPy and import it as np (not required but it is the standard).
A One-dimensional (1D) array object has all data as a single row. Sounds like a list right? These structures, however, can only contain a single data type and will perform coercion without any warnings. Let's create our first NumPy arrays
# !pip3 install numpy # Always keep installation commands commented out to prevent unwanted intallations
import numpy as np # Already done in section 0.5.0
# All integers
np.array([5, 16, 23])
# Coerce to float
np.array([1.2, 4, 12.12])
# Coerce to Unicode string
np.array([5, 1.2, "GCU"])
array([ 5, 16, 23])
array([ 1.2 , 4. , 12.12])
array(['5', '1.2', 'GCU'], dtype='<U32')
.shape property to retrieve array dimensions¶If you're not sure how many elements, columns, rows etc., you have in your array, you can quickly retrieve it using the .shape property. We'll talk more about what a property is in section 5.2.0.
array_1 = np.array([5, 16, 23])
array_1.shape
# (3,) means 3 rows and no columns (remember this is a 1D array)
(3,)
Remember that using operators like + and * on lists end up producing behaviours related concatenation of the list objects themselves, with arrays we can perform element-wise math operations if the arrays are the same size and if the elements are suitable for math operations.
# Multiplication of a scalar to array
np.array([5, 16, 23]) * 2
# Multiplication of two size-matched arrays
np.array([5, 16, 23]) * np.array([3, 19, 15])
np.array([5, 16, 23]) / np.array([2])
# What happens when the sizes don't match?
np.array([5, 16, 23, 30]) * np.array([3, 2])
array([10, 32, 46])
array([ 15, 304, 345])
array([ 2.5, 8. , 11.5])
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-94-fda5791c469c> in <module> 8 9 # What happens when the sizes don't match? ---> 10 np.array([5, 16, 23, 30]) * np.array([3, 2]) ValueError: operands could not be broadcast together with shapes (4,) (2,)
If we add up two NumPy arrays, their elements will be added up element-wise. The same rules apply as above
# Remember what + does for a list?
list([5, 16, 23]) + list([3, 19, 15])
# Compare to an array
np.array([5, 16, 23]) + np.array([3, 19, 15])
np.array([5, 16, 23]) + np.array([3])
# You can subtract too!
np.array([5, 16, 23]) - np.array([13])
# What happens if the arrays are not matched in size?
np.array([5, 16, 23]) + np.array([3, 19, 15, 16])
[5, 16, 23, 3, 19, 15]
array([ 8, 35, 38])
array([ 8, 19, 26])
array([-8, 3, 10])
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-95-85c87b034762> in <module> 11 12 # What happens if the arrays are not matched in size? ---> 13 np.array([5, 16, 23]) + np.array([3, 19, 15, 16]) ValueError: operands could not be broadcast together with shapes (3,) (4,)
Adding a list and an NumPy array results in the list behaving as an array (coercion) to complete the operation. Remember that the addition has to make sense to the interpreter too and a list is simply coerced to an array object. Will its contents be coerced as needed?
# Coercion happens regardless of addition order
list([5, 16, 23]) + np.array([3, 19, 15])
np.array([3, 19, 15]) + list([5, 16, 23])
# Will this work?
np.array([3, 19, 15]) + list(["5", "16", "23"])
# Can we concatenate string arrays like this?
np.array(["this", "that", "those"]) + np.array(["who", "what", "where"])
Just like lists, tuples, and dictionaries, NumPy arrays are data structures that have their own properties, and methods. As you've seen, we initialize arrays with the np.array() function. Similar to lists, we can use [] annotation to subset and slice these with the expected behaviour.
array_1 = np.array([3, 15, 21])
array_1[...]
array_1[...]
array_1[...]
array_1[...]
array_1[...]
Up until now we haven't really used conditional operators but we can quickly ask certain objects about their elements and if they fulfill a condition with a True or False result. With NumPy arrays, we can perform these conditional statements in an element-wise manner.
# Which elements in array_1 are larger than 17?
array_1 ...
Instead of just looking at an array of booleans, which can get quite long in large data sets, you can instead feed your conditional result back into the original array to retrieve the values that return true. This effectively gives you an array of actual element values that you might "want" to work with. We achieve this with the syntax array[conditional statement] in the above case array_1 > 17 is our conditional statement. Let's try!
# Return the values greater than 17
array_1[array_1 ...]
array_1[array_1 ...]
~¶In Python there are a number of ways to negate booleans (create the opposite value). With the NumPy package and arrays in particular we can use the ~ operator to perform a bitwise negation on the array. Like with all objects, however, if used outside this context the behaviour of the operator may not be as expected!
# Practice our negation on a toy array
... np.array([True, False, False, True, True])
# Return the values <= 17
array_1[...]
Put simply, a 2D NumPy array has rows and columns (hence the 2D name). Similar to their 1D counterparts, all data in a 2D array must be of the same type. We can make 2D arrays by combining two individual lists of the same size.
# The double squared bracket is required, otherwise you get a "type not understood" traceback
array_2 = np.array([[5, 16, 23], # The line break enhances readability but is not required
[3, 19, 15]])
array_2
or by type-converting a nested list
# Make a short 2D array of amino acid information
aminoacids_list = [["Alanine", "Ala", "A"],
["Arginine", "Arg", "R"],
["Asparagine", "Asn", "N"],
["Aspartic acid", "Asp", "D"]]
# Assign our list to an array
aminoacids_array = np.array(...)
# Let's take a look at it
aminoacids_array
# What are its dimensions?
aminoacids_array...
[row][column] notation¶To subset 2D arrays we use two sets of squared brackets: one for rows and one for columns which looks like my_array[row][column].
Let's select the element "15" from array_2.
array_2
array_2 [...][...] # What is the problem? What does "index out of bounds" means?
# Array has no row at index 2
array_2[...][2]
[row, column] annotation¶The same result as above can be achieved using one set of squared brackets with a comma that separates rows from columns which looks like my_array[row, column].
array_2[...]
That's right! Slicing has been implemented for arrays so you can pull out proper subsets unlike nested lists! For example, to select all rows or columns, use a : (colon) at the respective side of the comma. Let's practice some slicing.
array_2[...]
array_2[1, 0:2]
array_2[0:2, 0:2]
array_2[...] # This copies the array's reference from memory
So far we have been doing very basic operations to experiment with Python objects. For that reason, we have been using small datasets. In reality, we get big datasets that need to be analyzed in ways that are efficient and reliable.
In order to explore NumPy's statistical capabilities, we are going to simulate a more complex, very popular dataset in the data science world called iris. We will showcase some exploratory data analysis capabilities but we encourage you to look into R to do advanced stats and data visualization (CAGEF also has an introductory R course!).
Let's create the iris dataset
# loc= mean of the distribution, scale=standard deviation, size= how many elements
sepal_length = np.round(np.random.normal(loc = 5, scale=0.2, size=14), 2)
sepal_width = np.round(np.random.normal(loc = 3, scale=0.5, size=14), 2)
petal_length = np.round(np.random.normal(loc = 1.3, scale=0.2, size=14), 2)
petal_width = np.round(np.random.normal(loc = 0.15, scale=0.2, size=14), 2)
array_iris = np.column_stack([sepal_length.astype(float),
sepal_width.astype(float),
petal_length.astype(float),
petal_width.astype(float)
])
type(array_iris)
array_iris
mean() function on arrays¶Remember that both arrays and many of the statistical functions we'll be introducing are part of the NumPy package. That being said, there are implicit expectations regarding behaviours and attributes present in objects like arrays. Without getting too much into the philosophy, it can make the introduction of additional functions to a package both more flexible and rigid.
Let's take a closer look at the mean() function. Looking at the documentation we see the following information
numpy.mean(a, axis=None, dtype=None, out=None, keepdims=<no value>, *, where=<no value>) which we can break down to:
a an array containing numbers - or something that can be converted to such.axis defaults to "None" and take the mean of the flattened array but can also be set to 0, 1 or n (multi-dimensional arrays). where is optional but can be used to include a conditional array to determine which elements to use to make the calculation.Let's start with the basic use of mean() and build up from there. First, we can calculate the average for sepal length (column 1)
# Calculate mean across the entire array
np.mean(array_iris)
# Calculate the mean of our array but only the 1st column. Is that index 1?
np.mean(array_iris[...])
# Calculate the mean of row 2
np.mean(array_iris[...])
# Equivalent notation
np.mean(array_iris[...])
axis parameter to calculate the mean across dimensions¶As you can see from above, when we use the default behaviour of mean, it treat an array, regardless of dimension, like a flat list of numbers and take the mean of the whole set. What if, we want to calculate mean across rows or columns? Intuitively it can be a little confusing but recall that we use [row, column, etc] notation to access arrays. If we are working with an array with n rows and m columns then
axis = 0 will return an array of 1 x m length which will calculate the mean going down each column.axis = 1 will return an array of n x 1 which will calculate the mean across each row.Another way to think about it is that axis=0 returns a row of means, and axis=1 returns a column of means.
Note that we aren't even talking about multi-dimensional arrays where axis can also be assigned as a tuple to identify different dimensions you'd like to perform the calculation across.
# Calculate the mean of each column
np.mean(array_iris, ...)
# Calculate the mean across each row
np.mean(array_iris, ...)
We've been looking at just the mean() function but we can achieve different calculations with similar behaviours from
median()corrcoef()std()sum()# Calculate the median of column 2
np.median(array_iris[ ...])
# Calculate the correlation coefficient between sepal length (column 1) and sepal width (column 2)
np.corrcoef(array_iris[ : , 0], ...)
# What is the standard deviation of column 1?
np.std(array_iris[ : , 0])
# What is the sum of column 4?
np.sum(array_iris[ : , 3])
tranpose() with arrays in NumPy¶Sometimes you'd like to manipulate the shape or order of elements in your arrays. A common method that you may wish to perform is the transpose() method. This is a method available through the array object and it will take care of all of the details even for multi-dimensional arrays. It's actually part of the array manipulation set of routines.
For now we'll stick with the straightforward 2-dimensional array.
# Transpose array_iris
array_iris... # Why is it not working?
# Look at the array again
array_iris
# Transpose the array
array_iris...
You may remember from first- or second-year algebra the following formula
$$ A x A^{-1} = I $$Where $A$ is a square matrix we can find it's inverse $A^{-1}$ such that the multiplication of them produces the identity matrix which has all 0s except for a diagonal line of 1s.
To calculate the inverse matrix, we can use the function inv() from the linalg module of NumPy.
# Try to calculate the inverse of array_2
...
# to get the inverse, arrays must be squared (same number of rows and columns)
np.linalg.inv([[5, 16, 23],
[3, 19, 15],
[17, 2, 5]
...
Now what??? Why the "SyntaxError: unexpected EOF while parsing. What is EOF"?
EOF stands for "end of file", and it came up because we missed a closing parenthesis at the end of the code.
# Fix up our call to inv()
np.linalg.inv([[5, 16, 23],
[3, 19, 15],
[17, 2, 5]
...
dot() function¶The dot-product of two matrices calculates the sum of the product of elements between rows of matrix A and columns of matrix B. Sound familiar?
https://algebra1course.wordpress.com/2013/02/19/3-matrix-operations-dot-products-and-inverses/
# Simple case of dot-product
np.dot(np.array([[5, 16, 23]]),
np.array([[1], [2], [3]])) # Notice how we are making a "vertical" array?
# It can all be done in just one line but can be harder to read
# Calculate the dot-product a 3x3 matrix to a 1x3
np.dot(np.array([[ 5, 16, 23], [ 3, 19, 15], [17, 2, 5]]),
array_1)
DataFrame object¶To put it in context, Pandas expands NumPy capabilities in the same way that NumPy expands Python's. Pandas is a data manipulation tool developed by Wes McKinney, built on NumPy to simplify working with tabular datasets.
We'll cover this in more detail next week, but in properly formated tabular datasets, each column is a variable (a parameter that was measured) and each row is a set of observations (the results of quantitatively or qualitatively measuring each parameter).
There are two data structures in Pandas that we are interested in:
| Structure | Description | Characteristics |
|---|---|---|
Series |
A 1-dimensional array-like structure. | Contains a single data type |
| Values are mutable but size is not | ||
DataFrame |
A 2D labeled, tabular container for Series objects |
Resembles a spreadsheet |
| Size-mutable by adding columns |
Here is some information about Pandas library architecture:
pandas.core: data structures about Pandas librarypandas.src: Holds basic functionality of Pandaspandas.io: Tools to input and output files, data, etc.pandas.tools: Functional operations in PandasOther architectures include sparse (missing values), stats (statistical applications), util (testing and debugging tools), and rpy (R2Py, connectivity with R programming language).
First, install pandas using pip
# !pip install pandas
import pandas as ...
DataFrame from a dictionary object¶We can start our journey into Pandas by creating data frames out of dictionaries! Recall that these use a key:value structure. We can use keys as variables (columns) and values as observations (rows).
First, let's build a dictionary
# Build a dictionary of amino acids with 4 keys: Amino acid, single letter symbol, DNA codons, and Compressed
aminoacids_dict = {
"Aminoacid": ["Alanine","Arginine","Asparagine","Aspartic acid","Cysteine","Glutamine",
"Glutamic acid","Glycine","Histidine","Isoleucine","Leucine","Lysine",
"Methionine","Phenylalanine","Proline","Pyrrolysine","Serine","Threonine",
"Tryptophan","Tyrosine","Valine","START","STOP"],
"single letter symbol": ["A","R","N","D","C","Q","E","G","H","I","L","K","M",
"F","P","O","S","T","W","Y","V","START","STOP"],
"DNA codons": ["GCT GCC GCA GCG","CGT CGC CGA CGG AGA AGG","AAT AAC","GAT GAC","TGT TGC","CAA CAG CAR",
"GAA GAG","GGT GGC GGA GGG","CAT CAC","ATT ATC ATA","TTA TTG CTT CTC CTA CTG","AAA AAG",
"ATG","TTT TTC","CCT CCC CCA CCG","TAG","TCT TCC TCA TCG AGT AGC","ACT ACC ACA ACG ACN",
"TGG","TAT TAC","GTT GTC GTA GTG","ATG","TAA TGA TAG"],
"Compressed": ["GCN","CGN AGR MGN CGY","AAY","GAY","TGY","CAR","GAR","GGN","CAY","ATH","YTR CTY CTN TTR",
"AAR","ATG","TTY","CCN","TAG","TCN AGY","ACN","TGG","TAY","GTN", "ATG", "TRA TAG"]
}
aminoacids_dict
DataFrame() function¶Hard to read, right? It will look much better to human eyes if we convert it into a DataFrame object. We'll use the function DataFrame() to accomplish our goal and we'll also introduce a way to take a quick look at your data with the .head() method. This will allow us to look at a specified number of rows from the beginning of our DataFrame. In this case, the default number of rows is 5.
# Convert our dictionary to DataFrame
aminoacids_df = ...
# Take a peek at the first 5 rows of our DataFrame with the head() method
aminoacids_df...
# We'll just insert an extra line here for ease of readabilty
print()
type(aminoacids_df)
Notice how Jupyter has formatted the DataFrame for us into a nice readable table? Convenient!
Recall that objects can be composed of attributes (values) and methods. Up until now we have been collecting information or working with objects through their methods. Depending on who has implemented the code for your object and the language you are using, most attributes are (by good practices) private. That means that under the hood, you can't simply alter these attributes directly. Instead, you'll call on helper methods to accomplish this. Python has developed a specific implementation whereby if you are allowed to access/alter an attribute it will likely be of the property object. Defining attributes this way, you can define specific ways to get and set information about them (with failsafes!).
More importantly, all of the details are hidden from the user and we can simply access these properties with the .property syntax. While it may look like we are directly accessing an attribute, we are not.
.shape property to retrieve dimension information¶How many rows and columns do we have? That question can be answered by retrieving the .shape property which will return a pair of values in the format of (row, column).
# Check the shape of our dataframe
aminoacids_df...
.index property to retrieve/add/change row names in your DataFrame¶From above you may notice a single unnamed column in our DataFrame. It looks like it is using the potential indices of our rows as labels. These are the row names of our data frame and they can be altered to carrying meaningful information like amino acid abbreviations instead.
To alter our indices we can access the property .index and assign it directly to a new list. Now, let's add indices (row names) to our data frame.
# Retrieve the indices of our data frame
aminoacids_df...
# Alter the indices of our data frame
aminoacids_df... = ["Ala","Arg","Asn","Asp","Cys","Gln","Glu","Gly","His","Ile",
"Leu","Lys","Met","Phe","Pro","Pyl","Ser","Thr","Trp","Tyr",
"Val","START","START"]
aminoacids_df
.columns property to retrieve/change column names in your DataFrame¶Much like the .index property, we can use the .columns property to get and set our column names.
# Retrieve the column information for our data frame
aminoacids_df...
DataFrames, import them with read_csv()!¶Now we have Python a object that resembles a spreadsheet. However, creating data frames this way is very error-prone, let alone its tediousness. Small mistakes can pop up and what happens if your have hundred of thousands of rows of data?
The more sensible way to produce such large object is by reading in files using functions like Pandas' read_csv(). CSV stands for "comma-separated values" and is a type of text file like the TSV (tab-separated values) and other delimiter-separated values. Pandas will automatically store all of this information as a DataFrame object during import.
Let's import our first data file in Python using the aminoacids.csv file in this lecture's data directory.
# Import aminoacids.csv with read_csv
pd.read_csv(...)
# There as in issue with column one.
Column 0 is not automatically recognized as the indices so we need to explicitly state it using the right parameter. If you use the help(pd.read_csv) you'll see that there are a many parameters that can be assigned during our import call.
Help on function read_csv in module pandas.io.parsers:
read_csv(filepath_or_buffer: Union[ForwardRef('PathLike[str]'), str, IO[~T], io.RawIOBase, io.BufferedIOBase, io.TextIOBase, _io.TextIOWrapper, mmap.mmap], sep=<object object at 0x0000018D08E26E50>, delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal: str = '.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, dialect=None, error_bad_lines=True, warn_bad_lines=True, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None, storage_options: Union[Dict[str, Any], NoneType] = None)
Read a comma-separated values (csv) file into DataFrame.
For our purpose we want to use the index_col parameter, whose default values is None, to let the function know that there is an index column located in column 0.
# Import our file again with the correct index column
aminoacids_csv_df = pd.read_csv("data/aminoacids.csv", ...)
aminoacids_csv_df.head()
type(aminoacids_csv_df)
DataFrame methods¶So far we've seen some helpful properties for retrieving and altering attributes within our DataFrame objects. We've also seen the use of .head() to look at first n rows of our DataFrames. Similarly you can view the last n rows of your DataFrame with .tail() and you can retrieve an overall summary of the DataFrame with the .info() method.
# Get the tail end of our data.
# "n = 7" overwrites the default value of 5 rows
aminoacids_df...
# Retrieve the information about our data frame
aminoacids_csv_df.info()
DataFrame and Series objects with the [ ] notation.¶Pandas too can be subset with the [] syntax but in a limited fashion depending on the object:
Series: [label] returns a scalar (single element) value matching the labelDataFrame: [colName] returns an entire columnWe can pull annotation down using this notation in a number of ways like using the column names directly. Let's retrieve the column Aminoacid and see what kind of object is returned.
# Take a look at just the first few entries of the Aminoacid column
aminoacids_csv_df[...].head()
print()
type(aminoacids_csv_df["Aminoacid"])
DataFrame by column returns a Series¶A Pandas series is a 1D-labeled NumPy array that makes up rows and columns in data frames. Though they inherit much of their structure NumPy array, the Panda Series objects have their own attributes such as .values, which allows you to access the data contained in a Series but as a NumPy ndarray object.
codons_series = aminoacids_df[...]
type(codons_series)
codons_series.head() # Notice that the indices are still inherited from the data frame
# Access elements with the .values property
codons_series.values[...]
type(codons_series.values[1:5])
# Directly access with an index
codons_series[1:5]
type(codons_series[1:5])
DataFrame by providing a list¶The values in a series, as well a data frame rows and columns, are 1D NumPy arrays. The behaviour of using [] is such that by default of accessing a single column, a Series object is returned.
If, however we want to access a sub-portion of a DataFrame we can also provide a list. Here's where the notation can get funny because we define a list using [] as well! Therefore to subset multiple columns we need to use syntax that looks like dataFrame[["colName1", "colName2", "colNameN"]].
To top it all off, providing a list to subset your DataFrame will always return a DataFrame object. You can perform similar operations on a Series object as well!
# Grab two columns from our data frame and look at just the beginning
aminoacids_df...
type(aminoacids_df[["Aminoacid", "DNA codons"]])
# What kind of object will be returned when asking for a single column?
type(aminoacids_df[["Aminoacid"]])
# Directly access a Series with single or multiple labels too!
codons_series[...]
type(codons_series[["Arg", "Asn"]])
.loc and .iloc advanced data access methods for rows and columns¶As you can see above the standard [] notation can grant us access to parts of a DataFrame or Series but these aren't particularly optimized in their function. The Pandas package, however, provides optimized access using the .loc and .iloc methods which provide similar behaviour so remember the difference between [] and [[]]!
| Method | Description | Examples to put within [] |
|---|---|---|
.loc |
Used primarily for accessing using labels and will search for matches in this attribute | ['a', 'b', 'c'] |
'a':'f' |
||
'iloc |
Used primarily for accessing with integer positions | [4, 3, 0] |
1:7 |
Note that both of these methods also access an array of booleans where NA is treated as False. So you have options but be sure to choose the correct one!
aminoacids_csv_df...
aminoacids_csv_df... # Returns a Panda series broken into several lines
aminoacids_csv_df... # Returns a Panda DataFrame
# Subset rows and columns
aminoacids_csv_df.loc[["Ala", "Asn"], "Aminoacid"] # returns a series
aminoacids_csv_df.iloc[...] # returns a DataFrame. Notice the extra brackets!
# Subset multiple rows and columns
aminoacids_csv_df.loc[["Ala", "Asn"], ["Aminoacid", "DNA codons"]]
# Chained indexing gives us the same result but beware!
aminoacids_csv_df.loc[["Ala", "Asn"]][...]
Okay, it appears that we can use the [] to do almost anything but beware! We've been playing around a lot by pulling out sub-portions of our data frame. There is a lot happening under the hood but remember that we are dealing with objects! Depending on how we ask a Pandas object for access to it's data, it may be returning a view of the object or a copy!
In the above example we used .loc[row, col] notation to access our DataFrame but we also used .loc[row][col]. We've seen this before in accessing arrays as well, the ability to subset with in two ways.
In the first case, we are multi-indexing by calling on a method and passing two parameters, row and column, to the method. For Python and the DataFrame object, it all happens in a single step by essentially going to that reference and pulling out what we want using the DataFrame's internal methods.
In the second case, we are chain-indexing. Depending on a package's implementation this can give you very different results! In the case of a DataFrame, we are asking Python to first retrieve just the rows of the DataFrame we want. When that object is returned it needs to be assigned a temporary place in Python memory. At this point, it could become a separate object or entity from the original - a shallow copy! After that, we are then subsetting by col in a completely different command (by Python's viewpoint). If we were using these commands to set values in our DataFrame we could be setting them in an object that simply disappears!
Simply put, unless you have good reason to, with Pandas objects such as a DataFrame, avoid chain-index notation!
# Which one is the copy?
id(aminoacids_csv_df)
# Direct access
id(aminoacids_csv_df.loc[:, :])
# Retrieve a copy
id(aminoacids_csv_df.loc[:][:])
DataFrame objects by their attributes¶That's right, although there are some limitations on just how well this can work, you can treat the columns of your DataFrame like an accessible property. You can do similar with a Series object.
You'll run into problems and errors if you've used reserved keywords, other Pandas package-specific names, or if you break valid variable-naming convention but other than that, it is possible to use .column to access by non-numeric labels.
Hint: Use tab-completion to see attributes of an object!
# Access a row from aminoacids_csv_df
aminoacids_csv_df...
Use .iloc method to replicate the output of:
aminoacids_csv.loc[["Ala", "Ser", "Pyr"]]aminoacids_csv.loc[["Ala", "Ser", "Pyr"] , ["Aminoacid", "DNA codons"]]aminoacids_csv.loc[ : , ["Aminoacid", "DNA codons"]]
Data frames also supports broadcasting - the simultaneous transmission of the same message to multiple recipients. This general definition of broadcasting is, at least at this point, more informative than its computer-science technical definition. Broadcasting is a convenient and efficient way to create and populate (add data) to columns and rows. Lists and arrays also support broadcasting which we've kind of seen before when adding or multiplying across np.array objects.
Here is an example of broadcasting NaNs across a DataFrame. We'll also introduce the id() function which gives us the unique integer identification of an object.
# Help me to break down this line to understand what it means
aminoacids_NA_df = aminoacids_csv_df.copy()
# Are these the same object or different?
id(aminoacids_NA_df)
id(aminoacids_csv_df)
aminoacids_NA_df.loc[:,:] = ...
aminoacids_NA_df
Let's broadcast an entire new column ("broadcasted column") to aminoacids_csv
# All observation in the new column will be 0
aminoacids_NA_df = aminoacids_csv_df.copy()
aminoacids_NA_df[...] = "NA"
aminoacids_NA_df.head()
A common occurrence during data collection or generation is that some values could not be recorded, and this can happen for a variety of reasons: The equipment malfunctioned, some entries were deleted by mistake, or data was simply not available for patient A on a given day. These events lead to data gaps, and it is a critical issue that needs to be properly handled in order to get reliable insights about your datasets. Missing values are represented by NA (not available) and/or NaN (not a number).
Let's add some random missing values using the .reindex() method which will essentially create a copy of your DataFrame object and update the index with NaN values into rows for indices that did not previously exist.
df = pd.DataFrame(np.random.randn(5, 3),
index=['a', 'c', 'e', 'f','h'],
columns=['one', 'two', 'three']
)
NaN_df = ...(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
# Can you explain how NaNs were added? We do not have any explicit commands to add NaNs
NaN_df
# Of course, you can specify you fill value
df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'], fill_value = ...)
Do we have any missing values?
NaN_df["one"]...
Now, let's ask the opposite question: What observations are NaN?
NaN_df["one"]...
In the case of summations and means, NA/NaN are treated as 0 (zero), and if all the observations from a variable are NA/NaN, the output is NA
# Compare the original DataFrame to our NA-filled one
df["two"].sum()
NaN_df["two"].sum()
np.mean(df["two"])
np.mean(NaN_df["two"])
That's our second class on Python! You've made it through and we've learned about a lot of data structures:
| Structure | Package | Description | Initializer(s) | Indexed? | Mutable? | Nestable? |
|---|---|---|---|---|---|---|
| List | Python core | A 1D container of elements that can be made of anything | list() | Yes | Yes | Yes |
| [value1, value2, ...] | ||||||
| Tuple | Python core | A 1D container of elements that can be made of anything | tuple() | Yes | No | Yes |
| (value1, value2, ...) | ||||||
| Dictionary | Python core | A container of key:value pairs where keys must be immutable |
dict() | No | keys = No | Yes |
| {key1:value, key2:value2, ...} | values = Yes | |||||
| Array | Numpy | A multi-dimensional container of a single immutable data type | np.array() | Yes | Yes | NA |
| Series | Pandas | A 1D container of a single immutable data type | pd.series() | Yes | size = No | No |
| values = Yes | ||||||
| DataFrame | Pandas | A 2D container of Series objects (columns) | pd.DataFrame(dict) | Yes | num rows = No* | Yes** |
| num cols = Yes | ||||||
| values = Yes |
.append() rows to a DataFrame but this returns a new object whereas adding columns does not!Soon after the end of each lecture, a homework assignment will be available for you in DataCamp. Your assignment is to complete chapters 2 (Python Lists, 1300 possible points) and 4 (NumPy, 1400 possible points) from the Introduction to Python course. This is a pass-fail assignment, and in order to pass you need to achieve a least 2025 points (75%) of the total possible points. Note that when you take hints from the DataCamp chapter, it will reduce your total earned points for that chapter.
In order to properly assess your progress on DataCamp, at the end of each chapter, please take a screenshot of the summary. You'll see this under the "Course Outline" menubar seen at the top of the page for each course. It should look something like this:
Submit the file(s) for the homework to the assignment section of Quercus. This allows us to keep track of your progress while also producing a standardized way for you to check on your assignment "grades" throughout the course.
You will have until 13:59 hours on Thursday, July 1st to submit your assignment. There is no lecture that week but assignments are still due.
Revision 1.0.0: materials prepared by Oscar Montoya, M.Sc. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.1.0: edited and prepared for CSB1021H S LEC0140, 06-2021 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Here's a small example of list slicing and what we get back. Remember we talked about getting a view or copy back in section 5.5.4?
Let's recall that every object is assigned an integer ID in Python. You can find this with the id() function. When we assign a variable to an object, Python is providing a link or reference back to the ID of that object. Python essentially uses the IDs to assign memory space to an object where it's value can be stored. Python uses references to keep track of IDs so that when there are no more (0) references to an object ID, Python can reclaim the memory where its value is heald and its ID to use for something else in the program.
Whenever we slice through a list, it will return a copy of the references to those elements. Rather than copying each element to a new space in memory (and object ID), it uses the reference to find where those individual objects are in memory. This can bring up some rather tricky concepts when we have nested lists that are made using other list object references and what happens to them.
Depending on how you have sliced your original list, you may be getting back a direct reference (ID) to the original object, or references for the elements of the original list!
# A crazy example of how list references work!
list_1 = ["genome", 20, 30.5, True]
list_2 = [list_1, "this", "that", "those"]
# reference the first 3 elements of list_2 (a list, and two strings)
list_3 = list_2[0:3]
# reference the first 4 elements of list_1 (4 different references)
list_4 = list_1[0:4]
# reference the first 4 elements of the first element in list_2 (4 different references)
list_5 = list_2[0][0:4]
# reference the first element of list_2
list_6 = list_2[0]
# Copy all the references of list_2
list_7 = list_2[:]
# make a shallow copy of list_2
list_8 = list_2.copy()
# Change the first element of list_1
list_1[0] = "genomic"
# What is the resulting output?
list_1
list_2
list_3
list_4
list_5
list_6
list_7
list_8
# Look at the corresponding IDs of these variables!
id(list_1)
id(list_2)
id(list_3)
id(list_4)
id(list_5)
id(list_6)
# What is list_7? Is it it's own object?
id(list_7)
# What about the first element of list 7?
id(list_7[0])
Let's take a look at what just happened so we can break down the code in a bit more detail.
list_1 = ["genome", 20, 30.5, True]: we generate a list with 4 elements.
list_2 = [list_1, "this", "that", "those"]: we generate a second list of 4 elements with the first element being a list itself
list_3 = list_2[0:3]: make a list copying the first 3 references to list_2; this includes the reference to the list_1 object.
list_4 = list_1[0:4]: make a list by copying references for the first 4 elements in list_1; we do not copy the reference to list_1 object itself.
list_5 = list_2[0][0:4]: make a list by copying references for the first 4 elements in the first element of list_2; it should be very similar to list_4.
list_6 = list_2[0]: copy the reference to the first element of list_2 which is the reference to the list_1 object.
list_7 = list_2[:]: copy all of the references to the elements of list_2 which includes a reference to the list_1 object.
list_1¶list_1[0] = "genomic": Now we've changed the first element to list_1. Any other objects directly referencing list_1 will propogate this change. Objects that merely reference the elements of list_1 will not propagate this change.Here's a summary of the list objects and their theoretical output
| Object | Direct reference to list_1? |
example object ID | Contents at assignment | Contents after changing list_1 |
|---|---|---|---|---|
| list_1 | Yes, of course! | 1 | ['genome', 20, 30.5, True] |
['genomic', 20, 30.5, True] |
| list_2 | Yes, at index = 0 | 2 | [['genome', 20, 30.5, True], 'this', 'that', 'those'] |
[['genomic', 20, 30.5, True], 'this', 'that', 'those'] |
| list_3 | Yes, at index = 0 | 3 | [['genome', 20, 30.5, True], 'this', 'that'] |
[['genomic', 20, 30.5, True], 'this', 'that'] |
| list_4 | No | 4 | ['genome', 20, 30.5, True] |
['genome', 20, 30.5, True] |
| list_5 | No | 5 | ['genome', 20, 30.5, True] |
['genome', 20, 30.5, True] |
| list_6 | Yes, a direct reference | 1 | ['genome', 20, 30.5, True] |
['genomic', 20, 30.5, True] |
| list_7 | Yes, at index = 0 | 6 | [['genome', 20, 30.5, True], 'this', 'that', 'those'] |
[['genomic', 20, 30.5, True], 'this', 'that', 'those'] |
So we've taken a close look at lists and how they handle slicing. Remember to avoid getting a direct reference (view) for a simple list object when using the = assignment operator, you can use [:] of the .copy() method to copy all of the element references. When working with nested lists which contain references to other lists, however, beware of what you'll get.
From our above examples, what happens if we assigned list_6[0] = "GENOMIC". Can you guess?
Now that we've covered lists and their potential complications, should we take a closer look at NumPy arrays? Similar in concept to lists, recall that these objects are implemented in a package independent of the Python kernel development.
# The double squared bracket is required, otherwise you get a "type not understood" traceback
# First build our 2D array: 2 rows, 3 columns
array_1 = np.array([[5, 16, 23],
[3, 19, 15]])
# Slice all of the elements and assign to array_2
array_2 = array_1[:, :]
# Slice a subset of array_1. Is it a copy of the references?
array_3 = array_2[0:2, 0:2]
# Make a copy of array_3
array_4 = array_3
# Make a shallow copy of array_1
array_5 = array_1.copy()
# Change the first element of array_4
array_4[0,0] = 50
# Multiply all of array_3 with 2
array_3 = array_3 * 2
# What do array_2 and array_3 look like?
array_2
array_3
array_4
array_5
[ ] returns a reference to the original array¶So, the way arrays handle slicing is different to how lists handle the same commands. From the simple example above, we see that the underlying behaviour is that when you slice an array with [] you are always returned a reference or view of that array. That being said, altering an array slice's reference can be quirky.
array_4[0,0] = 50 is a direct slicing call with an assignment and so the change is propagated to array_1 and array_2array_3 = array_3 * 2 does not use any slicing notation. Instead the Python interpreter makes a new copy of array_3, completes the math on the array and then assigns this to the variable array_3. The reference that array_3 had to array_1 is released and replaced with one for this new object.These are intentional behaviours of the array object. If a reference to an array is not what you want, then you can use the .copy() method to generate a shallow copy of the array. Any changes in the original will not be propagated to the copy and vice versa!
What do you think array_3[:,:] = array_3 * 2 would do?
The Centre for the Analysis of Genome Evolution and Function (CAGEF) at the University of Toronto offers comprehensive experimental design, research, and analysis services in microbiome and metagenomic studies, genomics, proteomics, and bioinformatics.
From targeted DNA amplicon sequencing to transcriptomes, whole genomes, and metagenomes, from protein identification to post-translational modification, CAGEF has the tools and knowledge to support your research. Our state-of-the-art facility and experienced research staff provide a broad range of services, including both standard analyses and techniques developed by our team. In particular, we have special expertise in microbial, plant, and environmental systems.
For more information about us and the services we offer, please visit https://www.cagef.utoronto.ca/.